16 research outputs found

    Exploiting asymmetric multi-core systems with flexible system software

    Get PDF
    Asymmetric multi-cores (AMCs) are a successful architectural solution for both mobile devices and supercomputers. These architectures combine different types of processing cores designed at different performance and power optimization points, thus exposing a performance-power trade-off. By maintaining two types of cores, AMCs are able to provide high performance under the facility power budget. However, there are significant challenges when using AMCs such as scheduling and load balancing. This thesis initially explores the potential of AMCs when executing current HPC applications and searches for the most appropriate execution model. Specifically we evaluate several execution models on an Arm big.LITTLE AMC using the PARSEC benchmark suite that includes representative HPC applications. We compare schedulers at the user, OS and runtime system levels, using both static and dynamic options and multiple configurations, and assess the impact of these options on the well-known problem of balancing the load across AMCs. Our results demonstrate that scheduling is more effective when it takes place in the runtime system as it improves the user-level scheduling by 23%, while the heterogeneous-aware OS scheduling solution improves the user-level scheduling by 10%. Following this outcome, this thesis focuses on increasing performance of AMC systems by improving scheduling in the runtime system level. Scheduling in the runtime system level is provided by the use of task-based parallel programming models. These programming models offer programming flexibility as they consist of an interface and a runtime system to manage the underlying resources and threads. In this thesis we improve scheduling with task-based programming models by providing three novel task schedulers for AMCs. These dynamic scheduling policies reduce total execution time either by detecting the longest or the critical path of the dynamic task dependency graph of the application. They use dynamic scheduling and information discoverable during execution, fact that makes them implementable and functional without the need of off-line profiling. In our evaluation we compare these scheduling approaches with an existing state-of the art heterogeneous scheduler and we track their improvement over a FIFO baseline scheduler. We show that the heterogeneous schedulers improve the baseline by up to 1.45x on a real 8-core AMC and up to 2.1x on a simulated 32-core AMC. Another enhancement we provide in task-based programming models is the adaptability to fine grained parallelism. The increasing number of cores on modern CMPs is pushing research towards the use of fine grained workloads, which is an important challenge for task-based programming models. Our study makes the observation that task creation becomes a bottleneck when executing fine grained workloads with task-based programming models. As the number of cores increases, the time spent generating tasks is becoming more critical to the entire execution. To overcome this issue, we propose TaskGenX. TaskGenX minimizes task creation overheads and relies both on the runtime system and a dedicated hardware. On the runtime system side, TaskGenX decouples the task creation from the other runtime activities. It then transfers this part of the runtime to a specialized hardware. From our evaluation using 11 HPC workloads on both symmetric and AMC systems, we obtain performance improvements up to 15x, averaging to 3.1x over the baseline. Finally, this thesis presents a showcase for a real-time CPU scheduler with the goal to increase the frames per second (FPS) of the game-play on mobile devices with AMC systems. We design and implement the RTS scheduler in the Android framework. RTS provides an efficient scheduling policy that takes into account the current temperature of the system to perform task migration. RTS solution increases the median FPS of the baseline mechanisms by up to 7.5% and at the same time it maintains temperature stable.Los procesadores multin煤cleos asim茅tricos (AMC) son una soluci贸n arquitect贸nica exitosa para dispositivos m贸viles y supercomputadores. Estas arquitecturas combinan diferentes tipos de n煤cleos de procesamiento dise帽ados con diferentes propiedades de rendimiento y potencia. Al mantener dos o m谩s tipos de n煤cleos, los AMCs pueden proporcionar un alto rendimiento con un consumo bajo de energ铆a de las infraestructuras. Sin embargo, existen importantes desaf铆os al usar los AMC, como la programaci贸n y el equilibrio de carga. Esta tesis explora inicialmente el potencial de los AMC al ejecutar aplicaciones actuales de Computacion de Alto Rendimiento (HPC) y busca el modelo de ejecuci贸n m谩s apropiado para ellas. Espec铆ficamente evaluamos varios modelos de ejecuci贸n en un procesador asim茅trico Arm big.LITTLE utilizando las aplicaciones PARSEC que son aplicaciones representativas de HPC. En este trabajo se compara la programaci贸n en los niveles de usuario, sistema operativo y librer铆a y evaluamos el impacto de estas opciones en el conocido problema de equilibrar la carga entre los AMCs. Nuestros resultados demuestran que la programaci贸n es m谩s efectiva cuando se lleva a cabo en el nivel del runtime, ya que mejora la programaci贸n del nivel de usuario en un 23%, mientras que la soluci贸n de programaci贸n del sistema operativo heterog茅neo mejora la programaci贸n del nivel de usuario en un 10%. Siguiendo este resultado, esta tesis se centra en aumentar el rendimiento de los sistemas AMC mejorando la programaci贸n al nivel de librer铆a. La programaci贸n en este nivel se proporciona mediante el uso de Modelos de Programaci贸n Paralelos Basados en Tareas (MPBT). Estos modelos de programaci贸n ofrecen flexibilidad de programaci贸n, ya que consisten en una interfaz y un runtime para administrar los recursos e hilos subyacentes. En esta tesis, mejoramos la programaci贸n con MPBT al proporcionar tres nuevos planificadores de tareas para AMCs. Estos planificadores din谩micos reducen el tiempo total de ejecuci贸n ya sea detectando la camino m谩s largo o el camino cr铆tico del grafo de dependencia de tareas de la aplicaci贸n, que es generado din谩micamente. En nuestra evaluaci贸n, comparamos estos planificadores con un planificador heterog茅neo existente y demonstramos su mejora sobre un planificador FIFO. Mostramos que los planificadores heterog茅neos mejoran el planificador FIFO en hasta 1.45x en un AMC real de 8 n煤cleos y hasta 2.1x en un AMC simulado de 32 n煤cleos. Otra contribuci贸n en los MPBT es la adaptabilidad al paralelismo de grano fino. El creciente n煤mero de n煤cleos en los chip multin煤cleos modernos est谩 empujando la investigaci贸n hacia el uso de cargas de trabajo de grano fino, que es un desaf铆o importante para los MPBT. Nuestro estudio observa que la creaci贸n de tareas bloquea la ejecuci贸n con cargas de trabajo de grano fino con MPBT. Cuando el n煤mero de n煤cleos aumenta, el tiempo empleado en generar tareas pasa a ser m谩s cr铆tico para toda la ejecuci贸n. Nuestra soluci贸n es TaskGenX, que minimiza los costes de creaci贸n de tareas y se basa en una extensi贸n del runtime y en un hardware dedicado. En el runtime, TaskGenX desacopla la creaci贸n de tareas de las otras actividades del runtime, ejecutando esta actividad en un hardware especializado. Evaluamos 11 aplicaciones de HPC con TaskGenX en sistemas sim茅tricos y AMC y obtenemos mejoras de rendimiento de hasta 15x, con un promedio de 3.1x sobre la implementaci贸n de referencia. Finalmente, esta tesis presenta un planificador de CPU con el objetivo de aumentar los fotogramas por segundo (FPS) para juegos en dispositivos m贸viles con sistemas AMC. Dise帽amos e implementamos el planificador de Real-Time Scheduler (RTS) en Android. El RTS proporciona una pol铆tica de programaci贸n eficiente que tiene en cuenta la temperatura actual del sistema para realizar la migraci贸n de tareas. La soluci贸n RTS aumenta la FPS mediana de los mecanismos de referenci

    Exploiting asymmetric multi-core systems with flexible system software

    Get PDF
    Asymmetric multi-cores (AMCs) are a successful architectural solution for both mobile devices and supercomputers. These architectures combine different types of processing cores designed at different performance and power optimization points, thus exposing a performance-power trade-off. By maintaining two types of cores, AMCs are able to provide high performance under the facility power budget. However, there are significant challenges when using AMCs such as scheduling and load balancing. This thesis initially explores the potential of AMCs when executing current HPC applications and searches for the most appropriate execution model. Specifically we evaluate several execution models on an Arm big.LITTLE AMC using the PARSEC benchmark suite that includes representative HPC applications. We compare schedulers at the user, OS and runtime system levels, using both static and dynamic options and multiple configurations, and assess the impact of these options on the well-known problem of balancing the load across AMCs. Our results demonstrate that scheduling is more effective when it takes place in the runtime system as it improves the user-level scheduling by 23%, while the heterogeneous-aware OS scheduling solution improves the user-level scheduling by 10%. Following this outcome, this thesis focuses on increasing performance of AMC systems by improving scheduling in the runtime system level. Scheduling in the runtime system level is provided by the use of task-based parallel programming models. These programming models offer programming flexibility as they consist of an interface and a runtime system to manage the underlying resources and threads. In this thesis we improve scheduling with task-based programming models by providing three novel task schedulers for AMCs. These dynamic scheduling policies reduce total execution time either by detecting the longest or the critical path of the dynamic task dependency graph of the application. They use dynamic scheduling and information discoverable during execution, fact that makes them implementable and functional without the need of off-line profiling. In our evaluation we compare these scheduling approaches with an existing state-of the art heterogeneous scheduler and we track their improvement over a FIFO baseline scheduler. We show that the heterogeneous schedulers improve the baseline by up to 1.45x on a real 8-core AMC and up to 2.1x on a simulated 32-core AMC. Another enhancement we provide in task-based programming models is the adaptability to fine grained parallelism. The increasing number of cores on modern CMPs is pushing research towards the use of fine grained workloads, which is an important challenge for task-based programming models. Our study makes the observation that task creation becomes a bottleneck when executing fine grained workloads with task-based programming models. As the number of cores increases, the time spent generating tasks is becoming more critical to the entire execution. To overcome this issue, we propose TaskGenX. TaskGenX minimizes task creation overheads and relies both on the runtime system and a dedicated hardware. On the runtime system side, TaskGenX decouples the task creation from the other runtime activities. It then transfers this part of the runtime to a specialized hardware. From our evaluation using 11 HPC workloads on both symmetric and AMC systems, we obtain performance improvements up to 15x, averaging to 3.1x over the baseline. Finally, this thesis presents a showcase for a real-time CPU scheduler with the goal to increase the frames per second (FPS) of the game-play on mobile devices with AMC systems. We design and implement the RTS scheduler in the Android framework. RTS provides an efficient scheduling policy that takes into account the current temperature of the system to perform task migration. RTS solution increases the median FPS of the baseline mechanisms by up to 7.5% and at the same time it maintains temperature stable.Los procesadores multin煤cleos asim茅tricos (AMC) son una soluci贸n arquitect贸nica exitosa para dispositivos m贸viles y supercomputadores. Estas arquitecturas combinan diferentes tipos de n煤cleos de procesamiento dise帽ados con diferentes propiedades de rendimiento y potencia. Al mantener dos o m谩s tipos de n煤cleos, los AMCs pueden proporcionar un alto rendimiento con un consumo bajo de energ铆a de las infraestructuras. Sin embargo, existen importantes desaf铆os al usar los AMC, como la programaci贸n y el equilibrio de carga. Esta tesis explora inicialmente el potencial de los AMC al ejecutar aplicaciones actuales de Computacion de Alto Rendimiento (HPC) y busca el modelo de ejecuci贸n m谩s apropiado para ellas. Espec铆ficamente evaluamos varios modelos de ejecuci贸n en un procesador asim茅trico Arm big.LITTLE utilizando las aplicaciones PARSEC que son aplicaciones representativas de HPC. En este trabajo se compara la programaci贸n en los niveles de usuario, sistema operativo y librer铆a y evaluamos el impacto de estas opciones en el conocido problema de equilibrar la carga entre los AMCs. Nuestros resultados demuestran que la programaci贸n es m谩s efectiva cuando se lleva a cabo en el nivel del runtime, ya que mejora la programaci贸n del nivel de usuario en un 23%, mientras que la soluci贸n de programaci贸n del sistema operativo heterog茅neo mejora la programaci贸n del nivel de usuario en un 10%. Siguiendo este resultado, esta tesis se centra en aumentar el rendimiento de los sistemas AMC mejorando la programaci贸n al nivel de librer铆a. La programaci贸n en este nivel se proporciona mediante el uso de Modelos de Programaci贸n Paralelos Basados en Tareas (MPBT). Estos modelos de programaci贸n ofrecen flexibilidad de programaci贸n, ya que consisten en una interfaz y un runtime para administrar los recursos e hilos subyacentes. En esta tesis, mejoramos la programaci贸n con MPBT al proporcionar tres nuevos planificadores de tareas para AMCs. Estos planificadores din谩micos reducen el tiempo total de ejecuci贸n ya sea detectando la camino m谩s largo o el camino cr铆tico del grafo de dependencia de tareas de la aplicaci贸n, que es generado din谩micamente. En nuestra evaluaci贸n, comparamos estos planificadores con un planificador heterog茅neo existente y demonstramos su mejora sobre un planificador FIFO. Mostramos que los planificadores heterog茅neos mejoran el planificador FIFO en hasta 1.45x en un AMC real de 8 n煤cleos y hasta 2.1x en un AMC simulado de 32 n煤cleos. Otra contribuci贸n en los MPBT es la adaptabilidad al paralelismo de grano fino. El creciente n煤mero de n煤cleos en los chip multin煤cleos modernos est谩 empujando la investigaci贸n hacia el uso de cargas de trabajo de grano fino, que es un desaf铆o importante para los MPBT. Nuestro estudio observa que la creaci贸n de tareas bloquea la ejecuci贸n con cargas de trabajo de grano fino con MPBT. Cuando el n煤mero de n煤cleos aumenta, el tiempo empleado en generar tareas pasa a ser m谩s cr铆tico para toda la ejecuci贸n. Nuestra soluci贸n es TaskGenX, que minimiza los costes de creaci贸n de tareas y se basa en una extensi贸n del runtime y en un hardware dedicado. En el runtime, TaskGenX desacopla la creaci贸n de tareas de las otras actividades del runtime, ejecutando esta actividad en un hardware especializado. Evaluamos 11 aplicaciones de HPC con TaskGenX en sistemas sim茅tricos y AMC y obtenemos mejoras de rendimiento de hasta 15x, con un promedio de 3.1x sobre la implementaci贸n de referencia. Finalmente, esta tesis presenta un planificador de CPU con el objetivo de aumentar los fotogramas por segundo (FPS) para juegos en dispositivos m贸viles con sistemas AMC. Dise帽amos e implementamos el planificador de Real-Time Scheduler (RTS) en Android. El RTS proporciona una pol铆tica de programaci贸n eficiente que tiene en cuenta la temperatura actual del sistema para realizar la migraci贸n de tareas. La soluci贸n RTS aumenta la FPS mediana de los mecanismos de referenciaPostprint (published version

    Task scheduling techniques for asymmetric multi-core systems

    Get PDF
    As performance and energy efficiency have become the main challenges for next-generation high-performance computing, asymmetric multi-core architectures can provide solutions to tackle these issues. Parallel programming models need to be able to suit the needs of such systems and keep on increasing the application鈥檚 portability and efficiency. This paper proposes two task scheduling approaches that target asymmetric systems. These dynamic scheduling policies reduce total execution time either by detecting the longest or the critical path of the dynamic task dependency graph of the application, or by finding the earliest executor of a task. They use dynamic scheduling and information discoverable during execution, fact that makes them implementable and functional without the need of off-line profiling. In our evaluation we compare these scheduling approaches with two existing state-of the art heterogeneous schedulers and we track their improvement over a FIFO baseline scheduler. We show that the heterogeneous schedulers improve the baseline by up to 1.45 in a real 8-core asymmetric system and up to 2.1 in a simulated 32-core asymmetric chip.This work has been supported by the Spanish Government (SEV2015-0493), by the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P), by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), by the RoMoL ERC Advanced Grant (GA 321253) and the European HiPEAC Network of Excellence. The Mont-Blanc project receives funding from the EU鈥檚 Seventh Framework Programme (FP7/2007-2013) under grant agreement no 610402 and from the EU鈥檚 H2020 Framework Programme (H2020/2014-2020) under grant agreement no 671697. M. Moret贸 has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047. M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP B 00243).Peer ReviewedPostprint (author's final draft

    POSTER: Exploiting asymmetric multi-core processors with flexible system sofware

    Get PDF
    Energy efficiency has become the main challenge for high performance computing (HPC). The use of mobile asymmetric multi-core architectures to build future multi-core systems is an approach towards energy savings while keeping high performance. However, it is not known yet whether such systems are ready to handle parallel applications. This paper fills this gap by evaluating emerging parallel applications on an asymmetric multi-core. We make use of the PARSEC benchmark suite and a processor that implements the ARM big.LITTLE architecture. We conclude that these applications are not mature enough to run on such systems, as they suffer from load imbalance. Furthermore, we explore the behaviour of dynamic scheduling solutions on either the Operating System (OS) or the runtime level. Comparing these approaches shows us that the most efficient scheduling takes place in the runtime level, influencing the future research towards such solutions.This work has been supported by the Spanish Government (SEV2015-0493), by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P), by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), by the RoMoL ERC Advanced Grant (GA 321253) and the European HiPEAC Network of Excellence. The Mont-Blanc project receives funding from the EU's Seventh Framework Programme (FP7/2007-2013) under grant agreement number 610402 and from the EU's H2020 Framework Programme (H2020/2014-2020) under grant agreement number 671697. M. Moret贸 has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047. M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP B 00243).Peer ReviewedPostprint (author's final draft

    On the maturity of parallel applications for asymmetric multi-core processors

    Get PDF
    Asymmetric multi-cores (AMCs) are a successful architectural solution for both mobile devices and supercomputers. By maintaining two types of cores (fast and slow) AMCs are able to provide high performance under the facility power budget. This paper performs the first extensive evaluation of how portable are the current HPC applications for such supercomputing systems. Specifically we evaluate several execution models on an ARM big.LITTLE AMC using the PARSEC benchmark suite that includes representative highly parallel applications. We compare schedulers at the user, OS and runtime levels, using both static and dynamic options and multiple configurations, and assess the impact of these options on the well-known problem of balancing the load across AMCs. Our results demonstrate that scheduling is more effective when it takes place in the runtime system level as it improves the baseline by 23%, while the heterogeneous-aware OS scheduling solution improves the baseline by 10%.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), and by the European Union's Horizon 2020 research and innovation programme under grant agreement No 671697 and No. 779877. M. Moret贸 has been partially supported by the Ministry of Economy and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104.Peer ReviewedPostprint (author's final draft

    CATA: Criticality aware task acceleration for multicore processors

    Get PDF
    Managing criticality in task-based programming models opens a wide range of performance and power optimization opportunities in future manycore systems. Criticality aware task schedulers can benefit from these opportunities by scheduling tasks to the most appropriate cores. However, these schedulers may suffer from priority inversion and static binding problems that limit their expected improvements. Based on the observation that task criticality information can be exploited to drive hardware reconfigurations, we propose a Criticality Aware Task Acceleration (CATA) mechanism that dynamically adapts the computational power of a task depending on its criticality. As a result, CATA achieves significant improvements over a baseline static scheduler, reaching average improvements up to 18.4% in execution time and 30.1% in Energy-Delay Product (EDP) on a simulated 32-core system. The cost of reconfiguring hardware by means of a software-only solution rises with the number of cores due to lock contention and reconfiguration overhead. Therefore, novel architectural support is proposed to eliminate these overheads on future manycore systems. This architectural support minimally extends hardware structures already present in current processors, which allows further improvements in performance with negligible overhead. As a consequence, average improvements of up to 20.4% in execution time and 34.0% in EDP are obtained, outperforming state-of-the-art acceleration proposals not aware of task criticality.This work has been supported by the Spanish Government (grant SEV2015-0493, SEV-2011-00067 of the Severo Ochoa Program), by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316, TIN2012-34557, TIN2013-46957-C2-2-P), by Generalitat de Catalunya (contracts 2014-SGR- 1051 and 2014-SGR-1272), by the RoMoL ERC Advanced Grant (GA 321253) and the European HiPEAC Network of Excellence. The Mont-Blanc project receives funding from the EU鈥檚 Seventh Framework Programme (FP7/2007-2013) under grant agreement no 610402 and from the EU鈥檚 H2020 Framework Programme (H2020/2014-2020) under grant agreement no 671697. M. Moret麓o has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047. M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP B 00243). E. Castillo has been partially supported by the Spanish Ministry of Education, Culture and Sports under grant FPU2012/2254.Peer ReviewedPostprint (author's final draft

    Exploiting asymmetric multi-core systems with flexible system software

    No full text
    Asymmetric multi-cores (AMCs) are a successful architectural solution for both mobile devices and supercomputers. These architectures combine different types of processing cores designed at different performance and power optimization points, thus exposing a performance-power trade-off. By maintaining two types of cores, AMCs are able to provide high performance under the facility power budget. However, there are significant challenges when using AMCs such as scheduling and load balancing. This thesis initially explores the potential of AMCs when executing current HPC applications and searches for the most appropriate execution model. Specifically we evaluate several execution models on an Arm big.LITTLE AMC using the PARSEC benchmark suite that includes representative HPC applications. We compare schedulers at the user, OS and runtime system levels, using both static and dynamic options and multiple configurations, and assess the impact of these options on the well-known problem of balancing the load across AMCs. Our results demonstrate that scheduling is more effective when it takes place in the runtime system as it improves the user-level scheduling by 23%, while the heterogeneous-aware OS scheduling solution improves the user-level scheduling by 10%. Following this outcome, this thesis focuses on increasing performance of AMC systems by improving scheduling in the runtime system level. Scheduling in the runtime system level is provided by the use of task-based parallel programming models. These programming models offer programming flexibility as they consist of an interface and a runtime system to manage the underlying resources and threads. In this thesis we improve scheduling with task-based programming models by providing three novel task schedulers for AMCs. These dynamic scheduling policies reduce total execution time either by detecting the longest or the critical path of the dynamic task dependency graph of the application. They use dynamic scheduling and information discoverable during execution, fact that makes them implementable and functional without the need of off-line profiling. In our evaluation we compare these scheduling approaches with an existing state-of the art heterogeneous scheduler and we track their improvement over a FIFO baseline scheduler. We show that the heterogeneous schedulers improve the baseline by up to 1.45x on a real 8-core AMC and up to 2.1x on a simulated 32-core AMC. Another enhancement we provide in task-based programming models is the adaptability to fine grained parallelism. The increasing number of cores on modern CMPs is pushing research towards the use of fine grained workloads, which is an important challenge for task-based programming models. Our study makes the observation that task creation becomes a bottleneck when executing fine grained workloads with task-based programming models. As the number of cores increases, the time spent generating tasks is becoming more critical to the entire execution. To overcome this issue, we propose TaskGenX. TaskGenX minimizes task creation overheads and relies both on the runtime system and a dedicated hardware. On the runtime system side, TaskGenX decouples the task creation from the other runtime activities. It then transfers this part of the runtime to a specialized hardware. From our evaluation using 11 HPC workloads on both symmetric and AMC systems, we obtain performance improvements up to 15x, averaging to 3.1x over the baseline. Finally, this thesis presents a showcase for a real-time CPU scheduler with the goal to increase the frames per second (FPS) of the game-play on mobile devices with AMC systems. We design and implement the RTS scheduler in the Android framework. RTS provides an efficient scheduling policy that takes into account the current temperature of the system to perform task migration. RTS solution increases the median FPS of the baseline mechanisms by up to 7.5% and at the same time it maintains temperature stable.Los procesadores multin煤cleos asim茅tricos (AMC) son una soluci贸n arquitect贸nica exitosa para dispositivos m贸viles y supercomputadores. Estas arquitecturas combinan diferentes tipos de n煤cleos de procesamiento dise帽ados con diferentes propiedades de rendimiento y potencia. Al mantener dos o m谩s tipos de n煤cleos, los AMCs pueden proporcionar un alto rendimiento con un consumo bajo de energ铆a de las infraestructuras. Sin embargo, existen importantes desaf铆os al usar los AMC, como la programaci贸n y el equilibrio de carga. Esta tesis explora inicialmente el potencial de los AMC al ejecutar aplicaciones actuales de Computacion de Alto Rendimiento (HPC) y busca el modelo de ejecuci贸n m谩s apropiado para ellas. Espec铆ficamente evaluamos varios modelos de ejecuci贸n en un procesador asim茅trico Arm big.LITTLE utilizando las aplicaciones PARSEC que son aplicaciones representativas de HPC. En este trabajo se compara la programaci贸n en los niveles de usuario, sistema operativo y librer铆a y evaluamos el impacto de estas opciones en el conocido problema de equilibrar la carga entre los AMCs. Nuestros resultados demuestran que la programaci贸n es m谩s efectiva cuando se lleva a cabo en el nivel del runtime, ya que mejora la programaci贸n del nivel de usuario en un 23%, mientras que la soluci贸n de programaci贸n del sistema operativo heterog茅neo mejora la programaci贸n del nivel de usuario en un 10%. Siguiendo este resultado, esta tesis se centra en aumentar el rendimiento de los sistemas AMC mejorando la programaci贸n al nivel de librer铆a. La programaci贸n en este nivel se proporciona mediante el uso de Modelos de Programaci贸n Paralelos Basados en Tareas (MPBT). Estos modelos de programaci贸n ofrecen flexibilidad de programaci贸n, ya que consisten en una interfaz y un runtime para administrar los recursos e hilos subyacentes. En esta tesis, mejoramos la programaci贸n con MPBT al proporcionar tres nuevos planificadores de tareas para AMCs. Estos planificadores din谩micos reducen el tiempo total de ejecuci贸n ya sea detectando la camino m谩s largo o el camino cr铆tico del grafo de dependencia de tareas de la aplicaci贸n, que es generado din谩micamente. En nuestra evaluaci贸n, comparamos estos planificadores con un planificador heterog茅neo existente y demonstramos su mejora sobre un planificador FIFO. Mostramos que los planificadores heterog茅neos mejoran el planificador FIFO en hasta 1.45x en un AMC real de 8 n煤cleos y hasta 2.1x en un AMC simulado de 32 n煤cleos. Otra contribuci贸n en los MPBT es la adaptabilidad al paralelismo de grano fino. El creciente n煤mero de n煤cleos en los chip multin煤cleos modernos est谩 empujando la investigaci贸n hacia el uso de cargas de trabajo de grano fino, que es un desaf铆o importante para los MPBT. Nuestro estudio observa que la creaci贸n de tareas bloquea la ejecuci贸n con cargas de trabajo de grano fino con MPBT. Cuando el n煤mero de n煤cleos aumenta, el tiempo empleado en generar tareas pasa a ser m谩s cr铆tico para toda la ejecuci贸n. Nuestra soluci贸n es TaskGenX, que minimiza los costes de creaci贸n de tareas y se basa en una extensi贸n del runtime y en un hardware dedicado. En el runtime, TaskGenX desacopla la creaci贸n de tareas de las otras actividades del runtime, ejecutando esta actividad en un hardware especializado. Evaluamos 11 aplicaciones de HPC con TaskGenX en sistemas sim茅tricos y AMC y obtenemos mejoras de rendimiento de hasta 15x, con un promedio de 3.1x sobre la implementaci贸n de referencia. Finalmente, esta tesis presenta un planificador de CPU con el objetivo de aumentar los fotogramas por segundo (FPS) para juegos en dispositivos m贸viles con sistemas AMC. Dise帽amos e implementamos el planificador de Real-Time Scheduler (RTS) en Android. El RTS proporciona una pol铆tica de programaci贸n eficiente que tiene en cuenta la temperatura actual del sistema para realizar la migraci贸n de tareas. La soluci贸n RTS aumenta la FPS mediana de los mecanismos de referenci

    Exploiting asymmetric multi-core systems with flexible system software

    No full text
    Asymmetric multi-cores (AMCs) are a successful architectural solution for both mobile devices and supercomputers. These architectures combine different types of processing cores designed at different performance and power optimization points, thus exposing a performance-power trade-off. By maintaining two types of cores, AMCs are able to provide high performance under the facility power budget. However, there are significant challenges when using AMCs such as scheduling and load balancing. This thesis initially explores the potential of AMCs when executing current HPC applications and searches for the most appropriate execution model. Specifically we evaluate several execution models on an Arm big.LITTLE AMC using the PARSEC benchmark suite that includes representative HPC applications. We compare schedulers at the user, OS and runtime system levels, using both static and dynamic options and multiple configurations, and assess the impact of these options on the well-known problem of balancing the load across AMCs. Our results demonstrate that scheduling is more effective when it takes place in the runtime system as it improves the user-level scheduling by 23%, while the heterogeneous-aware OS scheduling solution improves the user-level scheduling by 10%. Following this outcome, this thesis focuses on increasing performance of AMC systems by improving scheduling in the runtime system level. Scheduling in the runtime system level is provided by the use of task-based parallel programming models. These programming models offer programming flexibility as they consist of an interface and a runtime system to manage the underlying resources and threads. In this thesis we improve scheduling with task-based programming models by providing three novel task schedulers for AMCs. These dynamic scheduling policies reduce total execution time either by detecting the longest or the critical path of the dynamic task dependency graph of the application. They use dynamic scheduling and information discoverable during execution, fact that makes them implementable and functional without the need of off-line profiling. In our evaluation we compare these scheduling approaches with an existing state-of the art heterogeneous scheduler and we track their improvement over a FIFO baseline scheduler. We show that the heterogeneous schedulers improve the baseline by up to 1.45x on a real 8-core AMC and up to 2.1x on a simulated 32-core AMC. Another enhancement we provide in task-based programming models is the adaptability to fine grained parallelism. The increasing number of cores on modern CMPs is pushing research towards the use of fine grained workloads, which is an important challenge for task-based programming models. Our study makes the observation that task creation becomes a bottleneck when executing fine grained workloads with task-based programming models. As the number of cores increases, the time spent generating tasks is becoming more critical to the entire execution. To overcome this issue, we propose TaskGenX. TaskGenX minimizes task creation overheads and relies both on the runtime system and a dedicated hardware. On the runtime system side, TaskGenX decouples the task creation from the other runtime activities. It then transfers this part of the runtime to a specialized hardware. From our evaluation using 11 HPC workloads on both symmetric and AMC systems, we obtain performance improvements up to 15x, averaging to 3.1x over the baseline. Finally, this thesis presents a showcase for a real-time CPU scheduler with the goal to increase the frames per second (FPS) of the game-play on mobile devices with AMC systems. We design and implement the RTS scheduler in the Android framework. RTS provides an efficient scheduling policy that takes into account the current temperature of the system to perform task migration. RTS solution increases the median FPS of the baseline mechanisms by up to 7.5% and at the same time it maintains temperature stable.Los procesadores multin煤cleos asim茅tricos (AMC) son una soluci贸n arquitect贸nica exitosa para dispositivos m贸viles y supercomputadores. Estas arquitecturas combinan diferentes tipos de n煤cleos de procesamiento dise帽ados con diferentes propiedades de rendimiento y potencia. Al mantener dos o m谩s tipos de n煤cleos, los AMCs pueden proporcionar un alto rendimiento con un consumo bajo de energ铆a de las infraestructuras. Sin embargo, existen importantes desaf铆os al usar los AMC, como la programaci贸n y el equilibrio de carga. Esta tesis explora inicialmente el potencial de los AMC al ejecutar aplicaciones actuales de Computacion de Alto Rendimiento (HPC) y busca el modelo de ejecuci贸n m谩s apropiado para ellas. Espec铆ficamente evaluamos varios modelos de ejecuci贸n en un procesador asim茅trico Arm big.LITTLE utilizando las aplicaciones PARSEC que son aplicaciones representativas de HPC. En este trabajo se compara la programaci贸n en los niveles de usuario, sistema operativo y librer铆a y evaluamos el impacto de estas opciones en el conocido problema de equilibrar la carga entre los AMCs. Nuestros resultados demuestran que la programaci贸n es m谩s efectiva cuando se lleva a cabo en el nivel del runtime, ya que mejora la programaci贸n del nivel de usuario en un 23%, mientras que la soluci贸n de programaci贸n del sistema operativo heterog茅neo mejora la programaci贸n del nivel de usuario en un 10%. Siguiendo este resultado, esta tesis se centra en aumentar el rendimiento de los sistemas AMC mejorando la programaci贸n al nivel de librer铆a. La programaci贸n en este nivel se proporciona mediante el uso de Modelos de Programaci贸n Paralelos Basados en Tareas (MPBT). Estos modelos de programaci贸n ofrecen flexibilidad de programaci贸n, ya que consisten en una interfaz y un runtime para administrar los recursos e hilos subyacentes. En esta tesis, mejoramos la programaci贸n con MPBT al proporcionar tres nuevos planificadores de tareas para AMCs. Estos planificadores din谩micos reducen el tiempo total de ejecuci贸n ya sea detectando la camino m谩s largo o el camino cr铆tico del grafo de dependencia de tareas de la aplicaci贸n, que es generado din谩micamente. En nuestra evaluaci贸n, comparamos estos planificadores con un planificador heterog茅neo existente y demonstramos su mejora sobre un planificador FIFO. Mostramos que los planificadores heterog茅neos mejoran el planificador FIFO en hasta 1.45x en un AMC real de 8 n煤cleos y hasta 2.1x en un AMC simulado de 32 n煤cleos. Otra contribuci贸n en los MPBT es la adaptabilidad al paralelismo de grano fino. El creciente n煤mero de n煤cleos en los chip multin煤cleos modernos est谩 empujando la investigaci贸n hacia el uso de cargas de trabajo de grano fino, que es un desaf铆o importante para los MPBT. Nuestro estudio observa que la creaci贸n de tareas bloquea la ejecuci贸n con cargas de trabajo de grano fino con MPBT. Cuando el n煤mero de n煤cleos aumenta, el tiempo empleado en generar tareas pasa a ser m谩s cr铆tico para toda la ejecuci贸n. Nuestra soluci贸n es TaskGenX, que minimiza los costes de creaci贸n de tareas y se basa en una extensi贸n del runtime y en un hardware dedicado. En el runtime, TaskGenX desacopla la creaci贸n de tareas de las otras actividades del runtime, ejecutando esta actividad en un hardware especializado. Evaluamos 11 aplicaciones de HPC con TaskGenX en sistemas sim茅tricos y AMC y obtenemos mejoras de rendimiento de hasta 15x, con un promedio de 3.1x sobre la implementaci贸n de referencia. Finalmente, esta tesis presenta un planificador de CPU con el objetivo de aumentar los fotogramas por segundo (FPS) para juegos en dispositivos m贸viles con sistemas AMC. Dise帽amos e implementamos el planificador de Real-Time Scheduler (RTS) en Android. El RTS proporciona una pol铆tica de programaci贸n eficiente que tiene en cuenta la temperatura actual del sistema para realizar la migraci贸n de tareas. La soluci贸n RTS aumenta la FPS mediana de los mecanismos de referenci
    corecore